Ancient Printed Documents Indexation: A New Approach

نویسندگان

  • Nicholas Journet
  • Rémy Mullot
  • Jean-Yves Ramel
  • Véronique Eglin
چکیده

Based on the study of the specificity of historical printed books and on the main error sources of classical methods of page layout analysis, this paper presents a new way to achieve an indexation of ancient printed documents. We have developed an approach based on the extraction and the quantification of the various orientations that are present in printed document images. The documents are initially splitted into homogenous areas in which we analyze significant orientations with a directional rose. Each kind of information (textual or graphical) is typically identified and labelled according to its orientation distribution. This choice of characterization allows us to separate textual regions from graphical ones by minimizing the a priori knowledge. The evaluation of our proposition lies on a document image retrieval using layout extraction criteria and can also be used to precisely localize graphical parts in various types of documents. The system has been tested with success over several ancient printed books of the Renaissance.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Indexation des documents XML : Un DataGuide annoté avec un index de contenu

Indexing in classical information retrieval brings few tools for the treatment of the semi-structured documents: the representations of documents in information retrieval were conceived for flat and homogeneous documents. They are not adapted to the simultaneous treatment of the structure and the contents. Several approaches of indexing semi-structured data was proposed to resolve this new chal...

متن کامل

Morphological Document Recovery in HSI Space

Old documents frequently appear with digitalization errors, uneven background, bleed-through effect etc... Motivated by the challenge to improve printed and handwritten text, we developed a new approach based on morphological color operators using HSI color space. Our approach is composed of a morphological background estimation for foreground/background separation and text segmentation, a back...

متن کامل

A proposition of a robust system for historical document images indexation

Characterizing noisy or ancient documents is a challenging problem up to now. Many techniques have been done in order to effectuate feature extraction and image indexation for such documents. Global approaches are in general less robust and exact than local approaches. That’s why, we propose in this paper, a hybrid system based on global approach (fractal dimension), and a local one, based on S...

متن کامل

Indexation conceptuelle par propagation. Application à un corpus d'articles scientifiques liés au cancer

Concept-based information retrieval is known to be a powerful and reliable process. However, the need of a semantically annotated corpus and its respective data structure ± e.g. a domain ontology ± can be problematic. The conception and enlargement of a semantic index is a tedious task, which needs to be addressed. We previously suggested an annotation propagation approach in a vector space rep...

متن کامل

Cleaning of Ancient Document Images Using Modified Iterative Global Threshold

Ancient document Image processing is an important area attracting many researchers in the recent period. Binarization is the first step while cleaning the document for further processing. Based on the degradation of the original document, either global or local thresholding methods are preferred. Thresholding phenomenon is a simple and practical approach to identify the cluster of pixels that a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005